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Abstract 


The Winograd Schema Challenge is both a com- 
monsense reasoning and natural language under- 
standing challenge, introduced as an alternative to 
the Turing test. A Winograd schema is a pair of 
sentences differing in one or two words with a 
highly ambiguous pronoun, resolved differently in 
the two sentences, that appears to require common- 
sense knowledge to be resolved correctly. The ex- 
amples were designed to be easily solvable by hu- 
mans but difficult for machines, in principle requir- 
ing a deep understanding of the content of the text 
and the situation it describes. This paper reviews 
existing Winograd Schema Challenge benchmark 
datasets and approaches that have been published 
since its introduction. 


1 Introduction 


The Winograd Schema Challenge was introduced by Hector 
Levesque [Levesque et al., 2012] both as an alternative to the 
Turing Test [Turing, 1950] and as a test of a system’s ability 
to do commonsense reasoning. 

An example of a Winograd schema is the pair of sentences 
introduced by Terry Winograd [1972]: 


The city councilmen refused the demonstrators a permit be- 
cause they [feared/advocated] violence. 

Question: Who [feared/advocated] violence? 

Answer: the city councilmen / the demonstrators 


The word they refers to the city councilmen or the demon- 
strators, depending on whether the word feared or advocated 
is used in the sentence. To correctly identify the referent, 
a human would probably need to know a good deal about 
demonstrators, permits, city councilmen, and demonstrations. 

Levesque’s insight was that one can construct many other 
pairs of sentences, which appear to rely on commonsense rea- 
soning, and for which sentence structure does not help dis- 
ambiguate the sentence. He claimed that a system that could 
achieve human performance in solving such sentences would 
have the commonsense knowledge that humans use when do- 
ing such disambiguation. Such pairs of sentences would have 


*Contact Author 


to be constructed to have certain properties, including being 
identical except for one or two “special” words and not be 
solvable by selectional restriction. An important constraint 
was that the Winograd schemas be “Google-proof” or non- 
associative [Trichelair et al., 2018], meaning that one could 
not use statistical associations to disambiguate the pronouns. 
As we discuss below, this is the least achievable constraint, 
as indicated by the success of statistical language models de- 
scribed in the survey. 

The Winograd Schema Challenge was appealing, because 
the task of pronoun disambiguation is easy and automatic for 
humans, the evaluation metrics were clear, and the trick of us- 
ing twin sentences seemed to eliminate using structural tech- 
niques to get to the right answer in ways that avoided using 
commonsense reasoning. In the years following its publica- 
tion, the challenge became a focal point of research for both 
the commonsense reasoning and natural language processing 
communities. 

A great deal of progress has been made in the last year. In 
this paper, we review the nature of the test itself, its different 
benchmark datasets, and the different techniques that have 
been used to address them. 


2 Winograd Schema Challenge Datasets 


Several Winograd Schema Challenge datasets have been in- 
troduced; for the most part, they can be split into two cate- 
gories: performance-measuring and diagnostic datasets. 


2.1 Original Collection of Winograd Schemas 


The first collection of 100 Winograd schemas were published 
together with the introduction of the Winograd Schema Chal- 
lenge [Levesque et al., 2012]'. Examples are constructed 
manually by AI experts, with the exact source for each ex- 
ample available. At the time of writing, there are 285 ex- 
amples available; however, the last 12 examples were only 
added recently. To ensure consistency with earlier models, 
several authors often prefer to report the performance on the 
first 273 examples only. These datasets are usually referred 
to as WSC285 and WSC273, respectively. 
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Subclasses of the original collection Trichelair et 
al. [2018] have observed that 37 sentences in the WSC273 
dataset (13.6%) are conceptually easier than the rest. The 
correct candidate is commonly associated with the rest of the 
sentence, while the incorrect candidate is not. An example of 
such a sentence is 

In the storm, the tree fell down and crashed through the 
roof of my house. Now, I have to get it [repaired/removed]. 

The roof is commonly associated with being repaired, 
while the tree is not. They call these examples associative 
and name the rest non-associative. Moreover, they find that 
models often perform much better on the associative subsets. 

Additionally, 131 sentences (48% of WSC273) were found 
to form meaningful examples if the candidates in the sentence 
are switched. An example of such sentence is 

Bob collapsed on the sidewalk. Soon he saw Carl coming 
to help. He was very [ill/concerned]. 

In this sentence, Bob and Carl can be switched to ob- 
tain an equivalent example with the opposite answers. Such 
sentences were named switchable. Trichelair et al. [2018] 
encourage future researchers to additionally report the con- 
sistency on the switchable dataset, when the candidates are 
switched, and when they are not. 

The list of associative and switchable examples together 
with their switched counterparts have been made public’. 


Winograd Schema Challenge in other languages. While 
the inspiration and original design of the challenge was in 
English, translations into other languages exist. Amsili and 
Seminck [2017] translated the collection of 144 Winograd 
schemas into French, and 285 original Winograd schemas 
were translated into Portugese by Melo et al. [2020]. Ad- 
ditionally, translations to Japanese? and Chinese* are avail- 
able on the official webpage of the challenge. Authors of 
French and Portugese translation both report having to make 
some changes to the content to avoid unintended cues, such as 
grammatical gender. In the case of Portugese, 8 sentences had 
to be dropped, as no appropriate translation could be found. 


2.2 Definite Pronoun Resolution Dataset 


The Definite Pronoun Resolution (DPR) dataset is an easier 
variation of the Winograd Schema Challenge [Rahman and 
Ng, 2012]. The constraints on the Winograd schemas have 
been relaxed, and several examples in the dataset are not 
Google-proof. The dataset consists of 1322 training examples 
and 564 test examples, constructed manually. 6 examples in 
the training set reappear in WSC273 in a very similar form. 
These should be removed when training on DPR and evaluat- 
ing on WSC273. This dataset is also referred to as WSCR, as 
named by Opitz and Frank [2018]. 

An expanded version of this dataset, called WINOCOREF, 
has been released by Peng et al. [2015], who further annotate 
all previously ignored mentions (in their work, a mention can 
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be either a pronoun or an entity) in the sentences that were 
not annotated in the original work. In this way, they add 746 
mentions to the dataset, 709 of which are pronouns. More- 
over, Peng et al. [2015] argue that accuracy is not an appro- 
priate metric of performance on the WINOCOREF dataset and 
introduce a new one, called AntePre. 

They define AntePre as follows: Suppose there are k pro- 
nouns in the dataset, and each pronoun has nj,...,, an- 
tecedents. We can treat finding the correct candidate for 
each pronoun as a binary classification for each antecedent- 
pronoun pair. Let m be the number of correct binary deci- 
sions. AntePre is then computed as m / X}; ni. 


2.3 Pronoun Disambiguation Problem Dataset 


The Pronoun Disambiguation Problem (PDP) dataset con- 
sists of 122 problems of pronoun disambiguation collected 
from classic and popular literature, newspapers, and maga- 
zines. Because constructing Winograd schemas according to 
Levesque’s original guidelines was a difficult, manual pro- 
cess, PDPs, which were collected and vetted rather than con- 
structed, were intended to be used as a gateway set before 
administration of the Winograd Schema Challenge [Morgen- 
stern et al., 2016]. Each PDP, once collected, was vetted (and 
sometimes modified) to ensure that like Winograd schemas, 
the problems were of the sort that humans use commonsense 
knowledge to disambiguate, and were “Google-proof.” Al- 
though each PDP was vetted to remove examples where sen- 
tence structure would help find the answer, there was no “spe- 
cial” word, and thus, unlike Winograd schemas, no guarantee 
that sentence structure could not be exploited. PDPs were 
therefore expected to be easier than Winograd schemas. 

Example: Do you suppose that Peter is responsible for the 
captain’s illness? Maybe he bribed the cook to put something 
in his food. 

The referent of his is: (a) Peter or (b) the captain. 

62 examples of PDPs were published before the Winograd 
Schema Challenge was administered >, and 60 PDPs were 
included in the Winograd Schema Challenge that was admin- 
istered at IJCAI 2016 ê [Davis et al., 2017]. A corpus of 400 
sentences was collected semi-automatically from online text, 
with less vetting, by Davis and Pan [2015]’. 


2.4 Winograd Natural Language Inference 
Dataset 


The Winograd Natural Language Inference (WNLI) dataset 
is part of the GLUE benchmark [Wang et al., 2019b] and is 
a textual entailment variation of the Winograd Schema Chal- 
lenge. An example from WNLI is given below with the goal to 
determine whether the hypothesis follows from the premise. 


Premise: The city councilmen refused the demonstrators a 
permit because they feared violence. 

Hypothesis: The demonstrators feared violence. 

Answer: true / false 


> http://commonsensereasoning.org/disambiguation.html 
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The dataset consists of 634 training examples, 70 valida- 
tion examples, and 145 test examples. Training and valida- 
tion sets contain a major overlap with the WSC273 dataset, 
while test samples come from a previously unreleased collec- 
tion of Winograd schemas. Not all examples in this dataset 
contain the special word and therefore do not come in pairs. 
Kocijan et al. [2019b] note that examples are much easier to 
approach if the Winograd schemas are transformed from the 
textual entailment back into the pronoun resolution problem, 
and approached as such. 

The same collection of examples is used for the Super- 
GLUE benchmark [Wang et al., 2019a] as a pronoun reso- 
lution problem to begin with. For the purpose of this sur- 
vey paper, WNLI and SuperGlue WSC are considered the 
same dataset. They consist of the same examples and all ap- 
proaches to WNLI described in this paper transform the ex- 
amples as noted in the previous paragraph. 


2.5 WinoGender Dataset 


Unlike the previous datasets, WINOGENDER was created as 
a diagnostic dataset and is aimed to measure gender bias of 
the systems for pronoun resolution [Rudinger et al., 2018]. 
WINOGENDER consists of 120 hand-written sentence tem- 
plates, together with candidates and pronouns that can be in- 
serted into the templates to create valid sentences. 

In each sentence, one of the candidates is an occupation, 
usually one with a high imbalance in gender ratio (e.g., sur- 
geon). The other candidate is a participant (e.g., patient) or 
a neutral someone. For each sentence, either of the pronouns 
he, she, or they can be included to create a valid sentence, as 
the candidates are gender-neutral. All together, this gives 720 
Winograd schemas. An example from the dataset is 

The surgeon operated on the child with great care; 
[his/her] [tumor/affection] had grown over time. 

Note that the gender of the pronoun does not affect the ex- 
pected answer; however, a biased system that associates the 
pronoun his with the surgeon is likely to answer one of them 
incorrectly. The aim of this dataset is not to measure model 
performance, as its data distribution is highly skewed, but to 
help analyse the models for gender bias. 


2.6 WinoBias Dataset 


WINOBIAS was created by Zhao et al. [2018], which tries to 
identify gender bias in pronoun resolution models. WINO- 
BIAS and WINOGENDER were created concurrently but in- 
dependently, despite the same objective. They introduce a 
dataset with 3, 160 sentences, split equally into development 
and test. Each sentence contains two candidates that are se- 
lected from a list of jobs with highly imbalanced gender ratio. 

Two different templates are used to create Winograd 
schemas. Type 1 sentences follow a structure that does not 
give away any syntactic cues. The authors thus estimate these 
sentences to be more challenging. An example of such a sen- 
tence is 

The farmer knows the editor because [he/she] [is really 
famous/likes the book]. 

Type 2 sentences can be answered based on the structure of 
the sentence. The authors thus expect the models to perform 
better. An example of such a sentence is 


The accountant met the janitor and wished [her/him] well. 

Its “twin pair” has the candidates swapped. As the struc- 
ture of the sentence gives the answer away, there is no spe- 
cial word. 

Moreover, the authors evenly split the whole dataset 
into pro-stereotypical and anti-stereotypical, depending on 
whether the gender of the pronoun matches the most com- 
mon gender of the referent occupation or not. They observe 
that publicly available models for co-reference resolution ex- 
hibit a major difference (up to 21.1% F) in performance on 
pro- and anti- subsets of the dataset. 


2.7 WinoGrande Dataset 


The WINOGRANDE dataset is a large-scale Winograd Sche- 
ma Challenge dataset (44k examples) [Sakaguchi et al., 
2020] collected via crowdsourcing on Amazon Mechanical 
Turk. To prevent the crowd from creating lexically and stylis- 
tically repetitive examples, the workers are primed by a ran- 
domly chosen topic from a WikiHow article as a suggestive 
context. Finally, the authors use an additional crowd of work- 
ers to ensure that the sentences are hard but not ambiguous to 
humans. These measures were taken to ensure that there is no 
instance-level bias that models could exploit. 

However, checking for instance-level cues is often not 
enough, as models tend to pick on dataset-level biases. The 
authors additionally introduce the AFLITE adversarial filter- 
ing algorithm. They use a fine-tuned RoBERTa language 
model [Liu et al., 2019] to gain contextualized embeddings 
for each instance. Using these embeddings, they iteratively 
train an ensemble of linear classifiers, trained on random sub- 
sets of the data and discard top-k instances that were correctly 
resolved by more than 75% of the classifiers. By iteratively 
applying this algorithm, the authors identify a subset (12, 282 
instances), called WINOGRANDEegepiasea- Finally, they split 
this dataset into training (9, 248), development (1, 267), and 
test (1, 767) sets. They also released the unfiltered training 
set WINOGRANDE, with 40, 938 examples. 


2.8 WinoFlexi Dataset 


Similarly to WinoGrande, Isaak and Michael [2019] aim to 
construct a dataset through crowdsourcing. They build their 
own system and collect 135 pairs of Winograd schemas (270 
examples). Unlike workers on WINOGRANDE, workers on 
WINOFLEXI are not presented with any particular topic and 
are free to pick it on their own. Despite this, authors find the 
collected schemas to have decent quality achieved through 
manual supervision between workers. 


3 Approaches to Winograd Schema Challenge 


At least three different methods have been used to try to solve 
the Winograd Schema Challenge. One class of approaches 
consists of feature-based approaches, typically extracting in- 
formation such as semantic relations. Additional common- 
sense knowledge is usually included in form of explicitly 
written rules from knowledge bases, web searches, or word 
co-occurrences. The collected information is then used to 
make a decision, using rule-based systems, various types of 
logics, or discrete optimization algorithms. We observe that 


the extraction of relevant information from the sentence is 
usually the bottleneck of these approaches. Given the nature 
of the challenge, even the slightest noise in the feature collec- 
tion can make the problem unsolvable. 

The second group of approaches are neural approaches, 
excluding language-model-based approaches, which we con- 
sider as a separate group. Neural-network-based approaches 
usually read the sentence as a whole, removing the bottle- 
neck of information extraction. To incorporate background 
information, these networks or their components are usually 
pre-trained on unstructured data, usually unstructured text, 
or other datasets for coreference resolution. Common ap- 
proaches to the tasks in this group take advantage of semantic 
similarities between word embeddings or use recurrent neural 
networks to encode the local context. We find this group of 
approaches to lack reasoning capabilities, as semantic simi- 
larity or local context usually do not contain sufficient infor- 
mation to solve Winograd schemas. 

The third group includes approaches that make use of 
large-scale pre-trained language models, trained with deep 
neural networks, extensively pre-trained on large corpora of 
text. Some of the approaches then additionally fine-tune the 
model on Winograd-Schema-Challenge-style data to maxi- 
mize their performance. Approaches in this group achieve 
visibly better performance than approaches from the first two 
groups. 

Due to a scattered nature of the results, we decided not to 
combine them into one large table. Not all methods are eval- 
uated on the same set of examples. Moreover, choices non- 
crucial to the idea, such as the choice of word embeddings or 
a language model can significantly affect the results, making 
the direct comparison unfair. 


3.1 Feature-based Approaches 


This section covers the approaches that collect knowledge in 
form of explicit rules from knowledge bases, internet search 
queries, and use logic-based systems or optimization tech- 
niques to deduce the answer. We emphasize that results of 
methods that rely on search engines, such as Google, can be 
irreproducible, as they strongly depend on the search results. 
The first model was introduced by Rahman and Ng [2012] 
together with the DPR dataset. The features that consist 
of Google queries, narrative chains, and semantic compat- 
ibility, were used to rank candidates with an SVM-based 
ranker. Their approach achieved 73.05% accuracy on the 
DPR test set. Peng et al. [2015] achieved 76.41% accuracy 
on this same dataset using integer linear programming and 
manually constructed schemas to learn conditioning from un- 
structured text. They additionally evaluated their model on 
WINOCOREF, where they achieved 89.32 AntePre score. 
Sharma et al. [2015] constructed a general-purpose se- 
mantic parser and use it to parse and answer Winograd 
schemas. The parser is used to extract relevant informa- 
tion from the sentence and internet search queries. An- 
swer set programming (ASP) [Gelfond and Lifschitz, 1988; 
Baral, 2003] is then used to define the rules and constructs 
for reasoning. Due to their focus on specific types of reason- 
ing, the authors only evaluate their approach on 71 examples 
from WSC285 where such reasoning is present, with their 


systems correctly answering 49 examples (69% accuracy). 
As noted by Zhang and Song [2018], this same approach 
achieves 50% accuracy on a different subset of 92 examples. 
Isaak and Michael [2019] report this system to correctly solve 
38% of WINOFLEXI examples, to incorrectly solve 36%, and 
to make no decision on the remaining examples. As shown by 
Sharma [2019], the sentence parsing and the knowledge col- 
lection are the bottleneck of this process. Sharma [2019] de- 
velops an ASP-based algorithm, called WISCR, which cor- 
rectly solves 240 out of 285 WSC285 examples, if the input 
and background knowledge are provided by a human. On the 
other hand, this same algorithm only solves 120 of the exam- 
ples, if it uses K-Parser for input parsing and a search engine 
for knowledge hunting. 

Emami et al. [2018] developed the first model to achieve 
a better-than-chance accuracy (57.1%) performance on the 
entire WSC273. Their system is completely rule-based and 
focuses on high-quality knowledge hunting, rather than rea- 
soning, showing the importance of the former. Unlike neural 
approaches from later sections, this model is not negatively 
affected by switching candidates. 

Isaak and Michael [2016] take a similar approach and use a 
collection of heuristics and external systems for text process- 
ing, information extraction, and reasoning. The final system 
correctly resolves 170 of the 286 examples from an older col- 
lection of Winograd schemas? and 59% of WINOFLEXI. 

An interesting approach to reasoning was proposed by 
Fahndrich et al. [2018], who build a graph for each example 
by combining knowledge about words from several knowl- 
edge bases with semantic and syntactic information. They 
place a collection of markers on the pronoun and iteratively 
distribute them across the graph according to a manually de- 
signed set of rules. The candidate with the greatest number of 
markers after n steps is considered the answer. The approach 
is evaluated on PDP, where it obtains 74% accuracy. 


3.2 Neural Approaches 


This section contains approaches that rely on neural networks 
and deep learning, but do not use pre-trained language mod- 
els. Models in this section are usually designed, built, and 
trained from scratch, while models that use language models 
are usually built on top of an off-the-shelf pre-trained neural 
network. We find that several ideas introduced in this section 
are later adjusted and scaled to language models; see Sec- 
tion 3.3. Note that each work comes with a collection of mo- 
del-specific architecture designs that are not covered in detail. 

Liu et al. [2017a] were the first to use neural networks 
to approach the challenge. They introduce a neural associ- 
ation model to model causality and automatically construct a 
large collection (around 500, 000) of cause-effect pairs, that 
are used to train the model. The model is then trained to pre- 
dict whether the second part of the schema is the consequence 
of the first one. For evaluation, Liu et al. [2017a] manually 
select 70 Winograd schemas from the Wsc273 dataset that 
rely on cause-effect reasoning. Their best model achieves 
70% accuracy on this selected subset. In their subsequent 
work, Liu et al. [2017b] extend this approach and use it at 
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the Winograd Schema Challenge 2016 [Davis et al., 2017]. 
They develop their own pre-trained word embeddings, whose 
semantic similarity should correlate with cause-effect pairs, 
and train the final model on Ontonotes dataset for coreference 
resolution [Hovy et al., 2006]. This method achieved the final 
score of 58.3% on the PpP dataset and 52.8% on WSC273. 

Zhang and Song [2018] similarly try to augment word em- 
beddings that can take advantage of dependencies in the sen- 
tence. Unlike Liu et al. [2017b], their model is completely 
unsupervised and is not additionally trained on any labelled 
data. They modify the Skip-Gram objective for word embed- 
ding pre-training to additionally use and predict semantic de- 
pendencies, which can thus be used as additional information 
at test time. The introduced approach is tested on a manually 
selected set of 92 easy Winograd schemas from the WSC273 
dataset, achieving a 60.33% accuracy. Wang et al. [2019c] 
take a step further with the unsupervised deep semantic simi- 
larity model (UDSSM). Instead of augmenting the word em- 
bedding, they train BiLSTM modules to compute contextu- 
alized word embeddings. The best performing ensemble of 
their models achieves 78.3% accuracy on PDP and 62.4% ac- 
curacy on WSC273. 

Opitz and Frank [2018] are the first to try to generalize 
from DPR to WSC273 by training on the former and testing 
on the latter. We note that authors do not mention removing 
the overlap between them. In their approach, they replace the 
pronoun in question with one of the candidates. They design 
several Bi-LSTM-based models and train them to rank the 
sentence with the correct candidate better than the sentence 
with the incorrect candidate. Their best approaches achieve 
63% on DPR and an accuracy of 56% on WSC273, showing 
that generalizing from DPR to WSC273 is not trivial. 


3.3 Language Model Approaches 


This section covers the approaches that use neural language 
models to tackle the Winograd Schema Challenge. Most of 
them use one or more language models that were trained on 
a large corpus of text. Several authors use large pre-trained 
language models, such as BERT [Devlin et al., 2019], and 
have to tailor their approach accordingly. Many works thus 
focus on better fine-tuning of such language models instead 
of inventing new architectures. 

Trinh and Le [2018] were the first to use pre-trained lan- 
guage models. Similarly to Opitz and Frank [2018], two sen- 
tences are created from each example by replacing a pronoun 
with each of the two candidates. A language model, imple- 
mented as an LSTM and pre-trained on a large corpus of text 
is used to assign them a probability. The ensemble of 14 such 
language models obtained was evaluated on the PDP (70% ac- 
curacy) and WSC273 datasets (63.74% accuracy). Trichelair 
et al. [2018] have shown that this ensemble is highly inconsis- 
tent in the case of swapped candidates and mainly works well 
on the associative subset of WSC273. Radford et al. [2019] 
apply the same method to evaluate their GPT-2 language 
model and achieve 70.7% accuracy on WSC273. Melo et 
al. [2020] use this method on their Portugese version of the 
Winograd Schema Challenge. They use an LSTM-based lan- 
guage model, trained on text from Portugese Wikipedia, but 
only achieve a chance-level performance. 


Prakash et al. [2019] extend this approach with knowledge 
hunting. They find sentences on the web that describe a sim- 
ilar situation, but may be easier to resolve. They assume that 
the pronoun refers to the same candidate. They use the same 
method as Trinh and Le [2018] to compute probabilities of 
each candidate for each pronoun. The assumption that the 
pronoun in the mined sentence and the pronoun in the Wino- 
grad schema refer to the same entity is described and imposed 
with probabilistic soft logic [Kimmig et al., 2012]. That is, all 
pronouns are resolved to the same candidate in the most prob- 
able way. The best model obtained by combining language 
models and knowledge hunting in this way achieves 71.06% 
accuracy on WSC273 and 70.17% accuracy on WSC285. 

Klein and Nabi [2019] analyse inner attention layers of a 
pre-trained BERT-base language model [Devlin et al., 2019] 
to find the best referent. They define a maximum attention 
score, which computes how much the model has attended 
to each candidate across all the layers and attention heads. 
The model is evaluated on both PpP (68.3% accuracy) and 
WSC273 (60.3% accuracy). 

Kocijan et al. [2019b] adapt scores from Trinh and 
Le [2018] to masked language models, i.e., BERT [Devlin 
et al., 2019]. They additionally introduce an unsupervised 
pre-training dataset MASKEDWIKI from English Wikipedia, 
which is constructed by masking repeated occurrences of 
nouns (130M examples, downscaled to 2.4M). When fine- 
tuned on both MASKEDWIKI and DPR, BERT-large achieves 
72.5% performance on WSC273 and 74.7% on WNLI. By 
transforming WNLI examples as introduced in Section 2.4, 
this was the first model to beat the majority-class baseline. 

Several authors have used this approach to WNLI as part 
of the GLUE benchmark [Wang et al., 2019b] with best per- 
formance achieved by T5 at 94.5% [Raffel et al., 2019]. The 
improvement over Kocijan et al. [2019b] usually comes from 
more extensive pre-training, and training on the training set 
of WNLI, which was not used by Kocijan et al. [2019b], due 
to its overlap with WSC273. 

In their subsequent work, Kocijan et al. [2019a] introduce 
a dataset called WIKICREM (2.4M examples), generated in 
the same way as MASKEDWIKI, but restricted to masking 
personal names. By pre-training on WIKICREM and fine- 
tuning on other coreference datasets they achieve 84.8% ac- 
curacy on DPR, 71.8% on WSC273, 74.7% on WNLI, and 
86.7% on Ppp. 

Ye et al. [2019] introduce an align, mask, and select (AMS) 
pre-training method for masked language models. They find 
sentences that contain entities that are directly connected in 
the ConceptNet knowledge base [Speer and Havasi, 2012]. 
They mask one of them and train the model to pick it from 
a list of candidates over other similar candidates. They 
fine-tune the obtained model in the same way as Kocijan 
et al. [2019b] to achieve 75.5% and 83.6% accuracy on 
Wsc273 and WNLL, respectively. 

He et al. [2019] combine the masked token prediction 
model by Kocijan et al. [2019b] with the semantic similarity 
model by Wang et al. [2019c] to create a hybrid neural net- 
work model. The combined model achieves 75.1% accuracy 
on WSC285, 90.0% accuracy on Ppp, and 89.0% on WNLI. 
The WNLI result was achieved by using an ensemble. 


A different use of the BERT language model was used by 
Ruan et al. [2019], who take advantage of the BERT next sen- 
tence prediction feature. In addition to replacing the pronoun 
with a candidate, they split the sentence into two parts, pre- 
dicting whether the second part semantically follows the first 
one. To improve the performance, Ruan et al. [2019] encode 
syntactic dependency by changing some of the attention ten- 
sors within BERT and train on DPR. The BERT-large model 
combined with all the features achieves 71.1% accuracy on 
the Wsc273 dataset. 

Sakaguchi et al. [2020] use the same approach when eval- 
uating the ROBERTa baseline for the WINOGRANDE dataset; 
however, they do not modify any attention layers, and train 
on WINOGRANDE rather than DPR. They report achiev- 
ing 79.1% accuracy on WINOGRANDE, 90.1% accuracy on 
WSC273, 87.5% on PDP, 85.6% on WNLI, and 93.1% on 
DPR. To this point, this is the highest performance achieved 
on the WScC273 dataset by a large margin, showing the im- 
pact of additional training data. Curiously, they report achiev- 
ing chance-level improvement on the validation set of WINO- 
GRANDE when training on the WINOGRANDEegebiased- They 
suspect that the introduced model performed well on WINO- 
GRANDEg¢u1, because it trained to exploit a systemic bias 
within the dataset. 


4 Conclusion 


In this paper, we have reviewed and compared the datasets 
of Winograd schemas that have been created and the many 
systems that have been developed to attempt to solve them. 
Currently, the best of these systems, which exploit deep neu- 
ral networks and incorporate very large and sophisticated 
pre-trained transformer models, such as BERT or ROBERTa 
finetuned, are able to achieve around 90% accuracy rates on 
Wsc273 and similar datasets. 

Levesque et al. [2012] claimed that because of the use of 
twin sentences, “clever tricks involving word order or other 
features of words or groups of words will not work [emphasis 
added].” This prediction has been falsified, at least as far as 
the dataset produced with that paper is concerned. The paper 
did not anticipate the power of neural networks, the rapid ad- 
vances in natural language modelling technology resulting in 
language models like BERT, and the subtlety and complexity 
of the patterns in words that such technologies would be able 
to find and apply. 

The systems that have succeeded on the Winograd Schema 
Challenge have succeeded on the pronoun disambiguation 
task in small passages of text, but they have not demonstrated 
either the ability to perform other natural language under- 
standing tasks, or common sense. They have not demon- 
strated the ability to reliably answer simple questions about 
narrative text [Marcus and Davis, 2019] or to answer simple 
questions about everyday situations. Similarly, text generated 
using even state-of-the-art language modeling systems, such 
as GPT-2, frequently contains incoherences [Marcus, 2020]. 

The commonsense reasoning and the natural language un- 
derstanding communities require new tests, more probing 
than the Winograd Schema Challenge, but still easy to ad- 
minister and evaluate. Several tests have been proposed and 


seem promising. The problem of tracking the progress of a 
world model in narrative text is discussed by Marcus [2020]. 
A number of proposed replacements for the Turing Test [Mar- 
cus et al., 2016] likewise draw heavily on various forms of 
commonsense knowledge. 
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